Swedish changes by dresen · Pull Request #1242 · kaldi-asr/kaldi

dresen · 2016-12-02T14:36:27Z

The update in this PR makes te modifications to sprakbanken that was requested for sprakbanken_swe, makes the python scripts work with python 2.7.x, simplifies the recipe and gives better results. Because I have changed the data preprocessing, a new lexicon needs to be uploaded to openslr, but I cannot attach it to the PR.

Update from original

…prakbanken_swe and removed deprecated commands from run.sh

@jtrmal

…h python 2 and 3 on the request of @jtrmal (I think they are slower now because we use more regexes). Changed the preprocessing so case is not normalised and altered default behaviour to delete sentence-final '.' rather than convert to a token because it is more often the case that they are not spoken aloud.

…ased systems. Changed the scoring scripts in local/ to be similar to WSJ to get better analyses and changed the local/wer* scripts to fit this recipe.

… but particular Danish characters. Corrected error in previous commit that changes openfst version tools/Makefile

Update from original

danpovey · 2016-12-02T18:44:49Z

Thanks... let's wait until the lexicon is available at openslr before merging it. In general we don't like to overwrite files at openslr if they have been there a while, but this isn't a hard-and-fast rule. Did you plan for the new lexicon to have the same filename, and what are the differences from the old lexicon? I'm wondering whether we should give it a different filename.

…

On Fri, Dec 2, 2016 at 9:36 AM, Andreas Søeborg Kirkedal < ***@***.***> wrote: The update in this PR makes te modifications to sprakbanken that was requested for sprakbanken_swe, makes the python scripts work with python 2.7.x, simplifies the recipe and gives better results. Because I have changed the data preprocessing, a new lexicon needs to be uploaded to openslr, but I cannot attach it to the PR. ------------------------------ You can view, comment on, or merge this pull request online at: #1242 Commit Summary - Merge pull request #4 from kaldi-asr/master - Made the same modifications to sprakbanken as @jtrmal suggested for sprakbanken_swe and removed deprecated commands from run.sh - Modified python scripts called by sprak_data_prep.sh so they work with python 2 and 3 on the request of @jtrmal (I think they are slower now because we use more regexes). Changed the preprocessing so case is not normalised and altered default behaviour to delete sentence-final '.' rather than convert to a token because it is more often the case that they are not spoken aloud. - Modified run.sh and tuned #leaves and #Gauss on dev set for for GMM-based systems. Changed the scoring scripts in local/ to be similar to WSJ to get better analyses and changed the local/wer* scripts to fit this recipe. - Modify the filters in local/wer_* so they remove accents and umlauts, but particular Danish characters. Corrected error in previous commit that changes openfst version tools/Makefile File Changes - *M* egs/sprakbanken/s5/local/copy_dict.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-0> (6) - *M* egs/sprakbanken/s5/local/create_datasets.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-1> (2) - *M* egs/sprakbanken/s5/local/dict_prep.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-2> (129) - *M* egs/sprakbanken/s5/local/norm_dk/format_text.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-3> (11) - *A* egs/sprakbanken/s5/local/norm_dk/numbersLow.tbl <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-4> (265) - *M* egs/sprakbanken/s5/local/normalize_transcript.py <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-5> (17) - *M* egs/sprakbanken/s5/local/normalize_transcript_prefixed.py <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-6> (30) - *M* egs/sprakbanken/s5/local/score.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-7> (124) - *M* egs/sprakbanken/s5/local/sprak_data_prep.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-8> (62) - *A* egs/sprakbanken/s5/local/wer_hyp_filter <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-9> (5) - *A* egs/sprakbanken/s5/local/wer_output_filter <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-10> (5) - *A* egs/sprakbanken/s5/local/wer_ref_filter <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-11> (5) - *M* egs/sprakbanken/s5/local/writenumbers.py <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-12> (1) - *M* egs/sprakbanken/s5/run.sh <https://github.com/kaldi-asr/kaldi/pull/1242/files#diff-13> (311) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/1242.patch - https://github.com/kaldi-asr/kaldi/pull/1242.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1242>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu2NdJPg_Q6po6UU4Wtm-6FaIZzr-ks5rECzugaJpZM4LCnQa> .

dresen · 2016-12-03T19:42:35Z

The words in the new lexicon are not case normalised. Otherwise, the old and new version are the same. I had thought to just replace the old lexicon with the new one, but if you would like to keep the old version, I can rename the new one to e.g. lexicon-da-nonorm.tgz

danpovey · 2016-12-03T19:43:30Z

Yes, please rename, and email Yenda separately with the new file.

…

On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < ***@***.***> wrote: The words in the new lexicon are not case normalised. Otherwise, the old and new version are the same. I had thought to just replace the old lexicon with the new one, but if you would like to keep the old version, I can rename the new one to e.g. lexicon-da-nonorm.tgz — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG-BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa> .

jtrmal · 2016-12-05T14:46:10Z

Lexicon published -- http://www.openslr.org/8/ On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey <notifications@github.com> wrote:

…

Yes, please rename, and email Yenda separately with the new file. On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < ***@***.***> wrote: > The words in the new lexicon are not case normalised. Otherwise, the old > and new version are the same. I had thought to just replace the old lexicon > with the new one, but if you would like to keep the old version, I can > rename the new one to e.g. lexicon-da-nonorm.tgz > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1242 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG- BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa> .

danpovey · 2016-12-05T16:48:33Z

Andreas, let me know when the recipe is ready to check (e.g. the filename matches the one in openslr).

…

On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote: Lexicon published -- http://www.openslr.org/8/ On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***> wrote: > Yes, please rename, and email Yenda separately with the new file. > > > On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < > ***@***.***> wrote: > > > The words in the new lexicon are not case normalised. Otherwise, the old > > and new version are the same. I had thought to just replace the old > lexicon > > with the new one, but if you would like to keep the old version, I can > > rename the new one to e.g. lexicon-da-nonorm.tgz > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#1242 (comment)>, > or mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/ADJVuzWzHQuVaAQXG- > BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa> > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1242 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe- auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS-VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa> .

dresen · 2016-12-06T14:10:42Z

Ready for review, Dan. 2016-12-05 17:48 GMT+01:00 Daniel Povey <notifications@github.com>:

…

Andreas, let me know when the recipe is ready to check (e.g. the filename matches the one in openslr). On Mon, Dec 5, 2016 at 9:46 AM, jtrmal ***@***.***> wrote: > Lexicon published -- http://www.openslr.org/8/ > > On Sat, Dec 3, 2016 at 2:43 PM, Daniel Povey ***@***.***> > wrote: > > > Yes, please rename, and email Yenda separately with the new file. > > > > > > On Sat, Dec 3, 2016 at 2:42 PM, Andreas Søeborg Kirkedal < > > ***@***.***> wrote: > > > > > The words in the new lexicon are not case normalised. Otherwise, the > old > > > and new version are the same. I had thought to just replace the old > > lexicon > > > with the new one, but if you would like to keep the old version, I can > > > rename the new one to e.g. lexicon-da-nonorm.tgz > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > <#1242 (comment) >, > > or mute > > > the thread > > > <https://github.com/notifications/unsubscribe- auth/ADJVuzWzHQuVaAQXG- > > BWosJ4P6VyEHFgks5rEcYsgaJpZM4LCnQa> > > > . > > > > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#1242 (comment)>, > or mute > > the thread > > <https://github.com/notifications/unsubscribe- > auth/AKisX1Oqs9DHj_ob9fG6r-6LTs4Oety2ks5rEcZlgaJpZM4LCnQa> > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1242 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ADJVuyf5gS- VZXGlqjxjR5rkGtMKIYezks5rFCO2gaJpZM4LCnQa> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZKbKG2p8o6xMBG5ACFW_kBQwW6M8lFks5rFEBjgaJpZM4LCnQa> .

-- Med venlig hilsen Andreas Søeborg Kirkedal

danpovey · 2016-12-11T06:16:55Z

egs/sprakbanken/s5/local/dict_prep.sh

+dictdir=data/local/dict
 espeakdir='espeak-1.48.04-source'
-mkdir -p $dir
+mkdir -p $dictsrc $dictd  ir


seems to be a space in the middle of a word.

…dev set

danpovey · 2016-12-14T01:45:16Z

There is a conflict, can you please merge and resolve?

Merging to resolve conflict

Swedish changes (kaldi-asr#1242)

dresen added 6 commits November 10, 2016 09:07

Merge pull request #4 from kaldi-asr/master

1472b0b

Update from original

Made the same modifications to sprakbanken as @jtrmal suggested for s…

943aa36

…prakbanken_swe and removed deprecated commands from run.sh

Modified run.sh and tuned #leaves and #Gauss on dev set for for GMM-b…

9c811ce

…ased systems. Changed the scoring scripts in local/ to be similar to WSJ to get better analyses and changed the local/wer* scripts to fit this recipe.

Modify the filters in local/wer_* so they remove accents and umlauts,…

efa3205

… but particular Danish characters. Corrected error in previous commit that changes openfst version tools/Makefile

Merge pull request #5 from kaldi-asr/master

e840588

Update from original

Added new lexicon from openslr to copy_dict.sh and bugfix in run.sh

ae1fef3

danpovey reviewed Dec 11, 2016

View reviewed changes

dresen added 2 commits December 12, 2016 03:32

Remove space

d867569

Changed number of parallel deode jobs to match number of speakers in …

b7b53fa

…dev set

dresen added 3 commits December 14, 2016 05:19

Merge branch 'master' into swedish-changes

ac5d12c

Merge pull request #6 from kaldi-asr/master

b0eba63

Merging to resolve conflict

Resolved conflict with master

114a514

danpovey merged commit f6b82ad into kaldi-asr:master Dec 15, 2016

dresen deleted the swedish-changes branch December 15, 2016 07:57

dresen added a commit to dresen/kaldi that referenced this pull request Dec 15, 2016

Merge pull request #7 from kaldi-asr/master

bec69c2

Swedish changes (kaldi-asr#1242)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swedish changes#1242

Swedish changes#1242
danpovey merged 12 commits intokaldi-asr:masterfrom
dresen:swedish-changes

dresen commented Dec 2, 2016

Uh oh!

danpovey commented Dec 2, 2016 via email

Uh oh!

dresen commented Dec 3, 2016

Uh oh!

danpovey commented Dec 3, 2016 via email

Uh oh!

jtrmal commented Dec 5, 2016 via email

Uh oh!

danpovey commented Dec 5, 2016 via email

Uh oh!

dresen commented Dec 6, 2016 via email

Uh oh!

danpovey Dec 11, 2016

Uh oh!

danpovey commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dresen commented Dec 2, 2016

Uh oh!

danpovey commented Dec 2, 2016 via email

Uh oh!

dresen commented Dec 3, 2016

Uh oh!

danpovey commented Dec 3, 2016 via email

Uh oh!

jtrmal commented Dec 5, 2016 via email

Uh oh!

danpovey commented Dec 5, 2016 via email

Uh oh!

dresen commented Dec 6, 2016 via email

Uh oh!

danpovey Dec 11, 2016

Choose a reason for hiding this comment

Uh oh!

danpovey commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants